% applyrcu/applyrcu.tex

\QuickQuizChapter{chp:applyrcu:Applying RCU}{Applying RCU}

This chapter shows how to apply RCU to some examples discussed earlier
in this book.
In some cases, RCU provides simpler code, in other cases better
performance and scalability, and in still other cases, both.

\section{RCU and Per-Thread-Variable-Based Statistical Counters}
\label{sec:applyrcu:RCU and Per-Thread-Variable-Based Statistical Counters}

Section~\ref{sec:count:Per-Thread-Variable-Based Implementation}
described an implementation of statistical counters that provided
excellent
performance, roughly that of simple increment (as in the C \co{++}
operator), and linear scalability --- but only for incrementing
via \co{inc_count()}.
Unfortunately, threads needing to read out the value via \co{read_count()}
were required to acquire a global
lock, and thus incurred high overhead and suffered poor scalability.
The code for the lock-based implementation is shown in
Figure~\ref{fig:count:Per-Thread Statistical Counters} on
Page~\pageref{fig:count:Per-Thread Statistical Counters}.

\QuickQuiz{}
	Why on earth did we need that global lock in the first place?
\QuickQuizAnswer{
	A given thread's \co{__thread} variables vanish when that
	thread exits.
	It is therefore necessary to synchronize any operation that
	accesses other threads' \co{__thread} variables with
	thread exit.
	Without such synchronization, accesses to \co{__thread} variable
	of a just-exited thread will result in segmentation faults.
} \QuickQuizEnd

\subsection{Design}

The hope is to use RCU rather than \co{final_mutex} to protect the
thread traversal in \co{read_count()} in order to obtain excellent
performance and scalability from \co{read_count()}, rather than just
from \co{inc_count()}.
However, we do not want to give up any accuracy in the computed sum.
In particular, when a given thread exits, we absolutely cannot
lose the exiting thread's count, nor can we double-count it.
Such an error could result in inaccuracies equal to the full
precision of the result, in other words, such an error would
make the result completely useless.
And in fact, one of the purposes of \co{final_mutex} is to
ensure that threads do not come and go in the middle of \co{read_count()}
execution.

\QuickQuiz{}
	Just what is the accuracy of \co{read_count()}, anyway?
\QuickQuizAnswer{
	Refer to
	Figure~\ref{fig:count:Per-Thread Statistical Counters} on
	Page~\pageref{fig:count:Per-Thread Statistical Counters}.
	Clearly, if there are no concurrent invocations of \co{inc_count()},
	\co{read_count()} will return an exact result.
	However, if there \emph{are} concurrent invocations of
	\co{inc_count()}, then the sum is in fact changing as
	\co{read_count()} performs its summation.
	That said, because thread creation and exit are excluded by
	\co{final_mutex}, the pointers in \co{counterp} remain constant.

	Let's imagine a mythical machine that is able to take an
	instantaneous snapshot of its memory.
	Suppose that this machine takes such a snapshot at the
	beginning of \co{read_count()}'s execution, and another
	snapshot at the end of \co{read_count()}'s execution.
	Then \co{read_count()} will access each thread's counter
	at some time between these two snapshots, and will therefore
	obtain a result that is bounded by those of the two snapshots,
	inclusive.
	The overall sum will therefore be bounded by the pair of sums that
	would have been obtained from each of the two snapshots (again,
	inclusive).

	The expected error is therefore half of the difference between
	the pair of sums that would have been obtained from each of the
	two snapshots, that is to say, half of the execution time of
	\co{read_count()} multiplied by the number of expected calls to
	\co{inc_count()} per unit time.

	Or, for those who prefer equations:
	\begin{equation}
	\epsilon = \frac{T_r R_i}{2}
	\end{equation}
	where $\epsilon$ is the expected error in \co{read_count()}'s
	return value,
	$T_r$ is the time that \co{read_count()} takes to execute,
	and $R_i$ is the rate of \co{inc_count()} calls per unit time.
	(And of course, $T_r$ and $R_i$ should use the same units of
	time: microseconds and calls per microsecond, seconds and calls
	per second, or whatever, as long as they are the same units.)
} \QuickQuizEnd

Therefore, if we are to dispense with \co{final_mutex}, we will need
to come up with some other method for ensuring consistency.
One approach is to place the total count for all previously exited
threads and the array of pointers to the per-thread counters into a single
structure.
Such a structure, once made available to \co{read_count()}, is
held constant, ensuring that \co{read_count()} sees consistent data.

\subsection{Implementation}

\begin{figure}[bp]
{ \scriptsize
\begin{verbatim}
  1 struct countarray {
  2   unsigned long total;
  3   unsigned long *counterp[NR_THREADS];
  4 };
  5 
  6 long __thread counter = 0;
  7 struct countarray *countarrayp = NULL;
  8 DEFINE_SPINLOCK(final_mutex);
  9 
 10 void inc_count(void)
 11 {
 12   counter++;
 13 }
 14 
 15 long read_count(void)
 16 {
 17   struct countarray *cap;
 18   unsigned long sum;
 19   int t;
 20 
 21   rcu_read_lock();
 22   cap = rcu_dereference(countarrayp);
 23   sum = cap->total;
 24   for_each_thread(t)
 25     if (cap->counterp[t] != NULL)
 26       sum += *cap->counterp[t];
 27   rcu_read_unlock();
 28   return sum;
 29 }
 30 
 31 void count_init(void)
 32 {
 33   countarrayp = malloc(sizeof(*countarrayp));
 34   if (countarrayp == NULL) {
 35     fprintf(stderr, "Out of memory\n");
 36     exit(-1);
 37   }
 38   memset(countarrayp, '\0', sizeof(*countarrayp));
 39 }
 40 
 41 void count_register_thread(void)
 42 {
 43   int idx = smp_thread_id();
 44 
 45   spin_lock(&final_mutex);
 46   countarrayp->counterp[idx] = &counter;
 47   spin_unlock(&final_mutex);
 48 }
 49 
 50 void count_unregister_thread(int nthreadsexpected)
 51 {
 52   struct countarray *cap;
 53   struct countarray *capold;
 54   int idx = smp_thread_id();
 55 
 56   cap = malloc(sizeof(*countarrayp));
 57   if (cap == NULL) {
 58     fprintf(stderr, "Out of memory\n");
 59     exit(-1);
 60   }
 61   spin_lock(&final_mutex);
 62   *cap = *countarrayp;
 63   cap->total += counter;
 64   cap->counterp[idx] = NULL;
 65   capold = countarrayp;
 66   rcu_assign_pointer(countarrayp, cap);
 67   spin_unlock(&final_mutex);
 68   synchronize_rcu();
 69   free(capold);
 70 }
\end{verbatim}
}
\caption{RCU and Per-Thread Statistical Counters}
\label{fig:applyrcu:RCU and Per-Thread Statistical Counters}
\end{figure}

Lines~1-4 of
Figure~\ref{fig:applyrcu:RCU and Per-Thread Statistical Counters}
show the \co{countarray} structure, which contains a
\co{->total} field for the count from previously exited threads,
and a \co{counterp[]} array of pointers to the per-thread
\co{counter} for each currently running thread.
This structure allows a given execution of \co{read_count()}
to see a total that is consistent with the indicated set of running
threads.

Lines~6-8 contain the definition of the per-thread \co{counter}
variable, the global pointer \co{countarrayp} referencing
the current \co{countarray} structure, and
the \co{final_mutex} spinlock.

Lines~10-13 show \co{inc_count()}, which is unchanged from
Figure~\ref{fig:count:Per-Thread Statistical Counters}.

Lines~15-29 show \co{read_count()}, which has changed significantly.
Lines~21 and 27 substitute \co{rcu_read_lock()} and
\co{rcu_read_unlock()} for acquisition and release of \co{final_mutex}.
Line~22 uses \co{rcu_dereference()} to snapshot the
current \co{countarray} structure into local variable \co{cap}.
Proper use of RCU will guarantee that this \co{countarray} structure
will remain with us through at least the end of the current RCU
read-side critical section at line~27.
Line 23 initializes \co{sum} to \co{cap->total}, which is the
sum of the counts of threads that have previously exited.
Lines~24-26 add up the per-thread counters corresponding to currently
running threads, and, finally, line 28 returns the sum.

The initial value for \co{countarrayp} is
provided by \co{count_init()} on lines~31-39.
This function runs before the first thread is created, and its job
is to allocate
and zero the initial structure, and then assign it to \co{countarrayp}.

Lines~41-48 show the \co{count_register_thread()} function, which
is invoked by each newly created thread.
Line~43 picks up the current thread's index, line~45 acquires
\co{final_mutex}, line~46 installs a pointer to this thread's
\co{counter}, and line~47 releases \co{final_mutex}.

\QuickQuiz{}
	Hey!!!
	Line~45 of
	Figure~\ref{fig:applyrcu:RCU and Per-Thread Statistical Counters}
	modifies a value in a pre-existing \co{countarray} structure!
	Didn't you say that this structure, once made available to
	\co{read_count()}, remained constant???
\QuickQuizAnswer{
	Indeed I did say that.
	And it would be possible to make \co{count_register_thread()}
	allocate a new structure, much as \co{count_unregister_thread()}
	currently does.

	But this is unnecessary.
	Recall the derivation of the error bounds of \co{read_count()}
	that was based on the snapshots of memory.
	Because new threads start with initial \co{counter} values of
	zero, the derivation holds even if we add a new thread partway
	through \co{read_count()}'s execution.
	So, interestingly enough, when adding a new thread, this
	implementation gets the effect of allocating a new structure,
	but without actually having to do the allocation.
} \QuickQuizEnd

Lines~50-70 shows \co{count_unregister_thread()}, which is invoked
by each thread just before it exits.
Lines~56-60 allocate a new \co{countarray} structure,
line~61 acquires \co{final_mutex} and line~67 releases it.
Line~62 copies the contents of the current \co{countarray} into
the newly allocated version, line~63 adds the exiting thread's \co{counter}
to new structure's total, and line~64 \co{NULL}s the exiting thread's
\co{counterp[]} array element.
Line~65 then retains a pointer to the current (soon to be old)
\co{countarray} structure, and line~66 uses \co{rcu_assign_pointer()}
to install the new version of the \co{countarray} structure.
Line~68 waits for a grace period to elapse, so that any threads that
might be concurrently executing in \co{read_count}, and thus might
have references to the old \co{countarray} structure, will be allowed
to exit their RCU read-side critical sections, thus dropping any such
references.
Line~69 can then safely free the old \co{countarray} structure.

\subsection{Discussion}

\QuickQuiz{}
	Wow!
	Figure~\ref{fig:applyrcu:RCU and Per-Thread Statistical Counters}
	contains 69 lines of code, compared to only 42 in
	Figure~\ref{fig:count:Per-Thread Statistical Counters}.
	Is this extra complexity really worth it?
\QuickQuizAnswer{
	This of course needs to be decided on a case-by-case basis.
	If you need an implementation of \co{read_count()} that
	scales linearly, then the lock-based implementation shown in
	Figure~\ref{fig:count:Per-Thread Statistical Counters}
	simply will not work for you.
	On the other hand, if calls to \co{count_read()} are sufficiently
	rare, then the lock-based version is simpler and might thus be
	better, although much of the size difference is due
	to the structure definition, memory allocation, and \co{NULL}
	return checking.

	Of course, a better question is "why doesn't the language
	implement cross-thread access to \co{__thread} variables?"
	After all, such an implementation would make both the locking
	and the use of RCU unnecessary.
	This would in turn enable an implementation that
	was even simpler than the one shown in
	Figure~\ref{fig:count:Per-Thread Statistical Counters}, but
	with all the scalability and performance benefits of the
	implementation shown in
	Figure~\ref{fig:applyrcu:RCU and Per-Thread Statistical Counters}!
} \QuickQuizEnd

Use of RCU enables exiting threads to wait until other threads are
guaranteed to be done using the exiting threads' \co{__thread} variables.
This allows the \co{read_count()} function to dispense with locking,
thereby providing
excellent performance and scalability for both the \co{inc_count()}
and \co{read_count()} functions.
However, this performance and scalability come at the cost of some increase
in code complexity.
It is hoped that compiler and library writers employ user-level
RCU~\cite{MathieuDesnoyers2009URCU} to provide safe cross-thread
access to \co{__thread} variables, greatly reducing the
complexity seen by users of \co{__thread} variables.

\section{RCU and Counters for Removable I/O Devices}
\label{sec:applyrcu:RCU and Counters for Removable I/O Devices}

Section~\ref{sec:count:Applying Specialized Parallel Counters}
showed a fanciful pair of code fragments for dealing with counting
I/O accesses to removable devices.
These code fragments suffered from high overhead on the fastpath
(starting an I/O) due to the need to acquire a reader-writer
lock.

This section shows how RCU may be used to avoid this overhead.

The code for performing an I/O is quite similar to the original, with
an RCU read-side critical section be substituted for the reader-writer
lock read-side critical section in the original:

\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\small
\begin{verbatim}
  1 rcu_read_lock();
  2 if (removing) {
  3   rcu_read_unlock();
  4   cancel_io();
  5 } else {
  6   add_count(1);
  7   rcu_read_unlock();
  8   do_io();
  9   sub_count(1);
 10 }
\end{verbatim}
\end{minipage}
\vspace{5pt}

The RCU read-side primitives have minimal overhead, thus speeding up
the fastpath, as desired.

The updated code fragment removing a device is as follows:

\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\small
\begin{verbatim}
  1 spin_lock(&mylock);
  2 removing = 1;
  3 sub_count(mybias);
  4 spin_unlock(&mylock);
  5 synchronize_rcu();
  6 while (read_count() != 0) {
  7   poll(NULL, 0, 1);
  8 }
  9 remove_device();
\end{verbatim}
\end{minipage}
\vspace{5pt}

Here we replace the reader-writer lock with an exclusive spinlock and
add a \co{synchronize_rcu()} to wait for all of the RCU read-side
critical sections to complete.
Because of the \co{synchronize_rcu()},
once we reach line~6, we know that all remaining I/Os have been accounted
for.

Of course, the overhead of \co{synchronize_rcu()} can be large,
but given that device removal is quite rare, this is usually a good
tradeoff.
